Background: Decided to participate in a 36 hour ‘over the weekend’ data science hackathon. No aspirations of placing, just learning. What I’m presenting here is my interpretation of the question, and approach to the problem. (Unfortunately, there was some ambiguity as to the actual structure of the problem*)
Basically a two-parter. Part A You, a data scientist at an insurance company, are trying to predict the probability of policy renewal for each customer (standard ‘churn’ problem, logistic regression). The interesting part was Part B the optimization, and my main focus here. You can spend extra (**“Incentives) to have agents pay extra attention to clients, and thereby gain an Increase in Probability** of renewal. BUT, this relationship was not linear.
pacman::p_load(tidyverse, plotROC, pROC, DMwR2, plotly)
fun.3 <- function(x) .2*(1-exp(-2*(1-exp(-x/400))))
p <- ggplot(data=data.frame(x=0), mapping = aes(x=x))
q <- p + stat_function(fun = fun.3)+xlim(0,1500)+ylim(0,.175)+
xlab('Incentives Spent to Increase Probability')+
ylab('Increase in Probability')
ggplotly(q)
The goal is to maximize Total Net Revenue, defined by: the updated Probability, P(baseline + incentive bump) times the Premium on the policy minus whatever you spent on Incentives (summed across all rows).
My approach: Work out the relationships between variables using function approximation. If/when I come across a more elegant/correct way to do the optimization, I’ll update this page or write a part 2.
I’ll ignore for now, since the dataset isn’t public, and there is no longer a way to check your score with their metric. Scoring was a weighted average that favored the AUC (of the ROC) for your Part A solution (70% of score), and the other ~30% of score was calculated from your total net revenue.
A simple logistic regression on untransformed variables performed fairly well in this case. (predictor variables included: age, number of premiums paid on time/late, income, and underwriter/credit score).
The classifier isn’t the focus here, so I’ll just show the quick and dirty code for Part A. The DMwR2 package in R makes kNN imputation quite easy - as there were a handful of NAs in both the training and test sets.
train <- read.csv('train.csv')
trainCC <- knnImputation(train[,-c(1)])
lr0 <- glm(renewal ~ ., data = trainCC, family = 'binomial')
lr0pred <- predict(lr0, trainCC, type = 'resp')
trainCC$lr0pred <- lr0pred
ggplot(trainCC, aes(d=renewal, m = lr0pred))+geom_roc()
auc <- roc(trainCC$renewal, lr0pred); auc # on training
##
## Call:
## roc.default(response = trainCC$renewal, predictor = lr0pred)
##
## Data: lr0pred in 4998 controls (trainCC$renewal 0) < 74855 cases (trainCC$renewal 1).
## Area under the curve: 0.8317
Sometime soon I’ll make a companion post exploring how to optimize the binary classifier model. But since the best AUC I saw was ~0.85, I call that darn good for a first-pass. FWIW, feature engineering/transforms did little to enhance the model, according to several of the top-ranked participants, and my own fiddling (quite frustrating!).
The first step was to figure out the optimum ‘ROI’ for Incentives, across the range of policy premiums. The smart way is to plot the equation, where z = ROI, x = Incentives, and y = Premium. It gives you a 3d surface - find the equation that defines the maximum for each x,y and you’re all set.
fun.3d <- function(x,y) (.2*(1-exp(-2*(1-exp(-x/400))))*y) - x
Except I don’t have the math skills to make that happen. The plot sure is pretty though…
incentives <- seq(0,1000,10)
premium <- seq(1200,60000, length.out=101)
roi <- as.matrix(read.csv('surface.csv', header = F))
plot_ly(x=~incentives,
y=~premium,
z=~roi) %>% add_surface()
What I did instead was turned to (old faithful) Excel, and plotted a poor man’s 3d surface. We already know the relationship between Incentive and Increase in Probability, so we can multiply the latter by a range of premium amounts, and deduct the incentives paid to calculate the ROI (see below).
The highlighted table in yellow is a list of values we can now use for curve fitting in R, to get a good approximation of that relationship. The output of that fit will tell us the optimal Incentive for each value of Premium.
ideal <- read.csv('ideal.csv', header=T)
colnames(ideal) <- c('premium', 'incentive')
tail(ideal)
## premium incentive
## 10 20000 580
## 11 25000 650
## 12 30000 700
## 13 40000 790
## 14 50000 860
## 15 60000 920
range(ideal$incentive)
## [1] 20 920
ggplot(ideal, aes(x=premium, y=incentive))+geom_point()
Now, this part will not make sense to most people - but since my background is in Pharmacology, I thought “Gee that looks an awful lot like a sigmoidal dose response curve”, so I download the drc package in R to fit the data. I guess you’ll just have to trust me that it makes sense (and I’m sure there are other packages and formulas that would approximate the fit just fine).
library(drc); library(modelr)
drc <- drm(incentive ~ premium, data = ideal, fct = LL.4(), type = 'continuous')
rsquare(drc, ideal)
## [1] 0.9997965
plot(drc, log = '', cex = 2)
Looks pretty close. Now we can call predict() to get the optimal Incentive as a function of Premium!
Now.. What about people who are already very likely to renew their policy? This formula works to find the maximum amount to spend on Incentive, as a function of Premium, above which we are not maximizing our ROI. But people with low P(baseline) (Probability of Renewal) are a relatively small portion of the dataset (~6%). Performing well overall will require optimizing ROI from the majority who are already near P(baseline)=1.
I noticed from the Excel sheet that spending less than the optimal amount never resulted in a LOSS (negative ROI). In other words, it is always worth spending a little to bump up your P(baseline) to 1, even if it was already very high.
Therefore, the first step on the test set is to calculate how much ‘room’ for improvement there is before you hit P(renewal) = 1. That is simply 1 - P(baseline) – which again is just the output of our logistic regression model.
test <- read.csv('test.csv') #read in
testCC <- knnImputation(test[,-c(1)]) # don't use id for knn
testCC <- cbind(id = test$id, testCC) # add it back
testPred <- predict(lr0, testCC, type = 'response') # get P(baseline) for test set
testCC$pred <- testPred # add back to dataframe
testCC$room <- 1 - testCC$pred # calculate room for improvement
OK, next is where we run into a problem: We have a new maximum for how much we should incentivize, but in units of Y (based on the relationship we already have for those variables; see fun.3 above).
AND, the equation is ridiculously complicated, to the point that Wolfram Alpha can’t even solve for X. We could go back to school for a math degree, OR… use another function approximation from observable data points.
y = seq(0,1000, 10) # range of incentives, new Y
deltaP <- fun.3(y) # Increase Prob for range of incentives, new X
solveX <- data.frame(x=deltaP, y=y)
solveX <- solveX[2:101,] # get rid of origin, zeros
tail(solveX)
## x y
## 96 0.1673989 950
## 97 0.1675483 960
## 98 0.1676933 970
## 99 0.1678342 980
## 100 0.1679709 990
## 101 0.1681038 1000
plot((solveX$x), log(solveX$y))
Wild. Ok, looks like a 3rd order polynomial might be able to approximate.
model <- lm(log(y) ~ x + I(x^2) + I(x^3) -1, data=solveX) # -1 = no intercept term, force thru origin
summary(model)
##
## Call:
## lm(formula = log(y) ~ x + I(x^2) + I(x^3) - 1, data = solveX)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.26135 -0.03762 -0.01638 0.06507 1.11193
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x 134.896 2.901 46.50 <2e-16 ***
## I(x^2) -1213.932 45.699 -26.56 <2e-16 ***
## I(x^3) 3904.845 176.162 22.17 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1812 on 97 degrees of freedom
## Multiple R-squared: 0.9991, Adjusted R-squared: 0.9991
## F-statistic: 3.666e+04 on 3 and 97 DF, p-value: < 2.2e-16
Well, I’m satisfied with the R-squared. Now we have a way to go from the ‘room’ variable (room for improvement before you hit probability of 1) to Incentive.
model$coefficients
## x I(x^2) I(x^3)
## 134.8965 -1213.9316 3904.8455
fun.model <- function(x) exp( (model$coefficients[1]*x )+
(model$coefficients[2]*(x^2))+
(model$coefficients[3]*(x^3)) )
We’ll use this to calculate what the incentive should be, when the ‘room’ for improvement is less than what the ideal incentive (only a function of premium) would suggest.
Here is the code to finish these calculations on the test set to get our predictions. If you followed along so far, this should be no problem.
testX <- testCC[,c(1,12:14)] # dplyr select knitr fail
testX <- testX %>%
mutate(ideal_incentive = predict(drc, test[c('id','premium')])) %>% # optimum based on premium
mutate(room_incentive = fun.model(pmin(room, 0.1669))) %>% # max spend dictated by headroom
mutate(incentives = trunc(pmin(ideal_incentive, room_incentive))) %>% # take the lower value
mutate(new_prob = fun.3(incentives) + pred) %>% # verify
filter(new_prob >= 1) %>% # uh oh?
# select(id, premium, pred, room) %>% # failing in knitr
arrange(desc(new_prob))
head(testX, 10)
## id premium pred room ideal_incentive room_incentive
## 1 73714 26400 0.9243101 0.07568995 665.6823 141.0298
## 2 97179 13800 0.9250901 0.07490988 488.7091 139.0090
## 3 68806 7500 0.9239201 0.07607991 343.1634 142.0297
## 4 103903 7500 0.9235314 0.07646865 343.1634 143.0195
## 5 17160 22200 0.9242994 0.07570062 616.3146 141.0572
## 6 2374 5400 0.9235211 0.07647888 273.4742 143.0455
## 7 6956 5400 0.9239052 0.07609477 273.4742 142.0676
## 8 42637 18000 0.9254659 0.07453412 558.5628 138.0259
## 9 45410 7500 0.9254658 0.07453423 343.1634 138.0262
## 10 32156 5700 0.9239029 0.07609708 284.5246 142.0735
## incentives new_prob
## 1 141 1.013903
## 2 139 1.013902
## 3 142 1.013900
## 4 143 1.013896
## 5 141 1.013892
## 6 143 1.013885
## 7 142 1.013885
## 8 138 1.013884
## 9 138 1.013884
## 10 142 1.013883
So, the caveat is that because my function approximations weren’t perfect, I’ve got a few predicted overall probabilities (after incentives) that are a hair over 1.0. There is no mechanism in place to detect or penalize for that (aside from lost revenue), so you could leave them as-is, or scale down the incentives by a certain percentage (seems to happen mostly when the premium is within a certain range).
nrow(testX) / nrow(testCC)
## [1] 0.18937
summary(testX$incentives)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 86.0 126.0 128.3 167.0 692.0
Of me: I got distracted as the deadline approached, and never submitted for a final score. I had made submissions that were on the top ~1/3 of the leaderboard before figuring out the optimization part, but.. the platform for this particular competition is a dud for people coming to learn:
A bit anti-climactic I suppose. Lesson learned: Stick to Kaggle if you’re learning.
I’d love to know the ‘right’ way to go about this problem, so if you have an opinion or alternate strategy, even a link to a useful tutorial, please share!